Metabolic function-based normalization improves transcriptome data-driven reduction of genome-scale metabolic models

Genome-scale metabolic models (GEMs) are extensively used to simulate cell metabolism and predict cell phenotypes. GEMs can also be tailored to generate context-specific GEMs, using omics data integration approaches. To date, many integration approaches have been developed, however, each with specific pros and cons; and none of these algorithms systematically outperforms the others. The key to successful implementation of such integration algorithms lies in the optimal selection of parameters, and thresholding is a crucial component in this process. To improve the predictive accuracy of context-specific models, we introduce a new integration framework that improves the ranking of related genes and homogenizes the expression values of those gene sets using single-sample Gene Set Enrichment Analysis (ssGSEA). In this study, we coupled ssGSEA with GIMME and validated the advantages of the proposed framework to predict the ethanol formation of yeast grown in the glucose-limited chemostats, and to simulate metabolic behaviors of yeast growth in four different carbon sources. This framework enhances the predictive accuracy of GIMME which we demonstrate for predicting the yeast physiology in nutrient-limited cultures.


INTRODUCTION
To illustrate the genotype-phenotype relationship of metabolic phenotypes in an environment, we require a map of genomescale biochemical reactions and their comprehensive connections. The GEnome-scale Metabolic models (GEM) provide a mathematical framework to gain understanding into metabolic capacity of a cell: enable system-wide analysis of genetic perturbations and metabolic engineering, identify the constraints that the chemical interactions operate under, explore metabolic diseases, as well as to find essential enzymatic reactions 1 .
Following the introduction of GEMs, a new challenge arose: the integration of omics data into a GEM for a better prediction of the metabolic functionalities. The conjunction of gene expression data and GEMs leads to a deeper understanding of the occurrence of certain changes in different conditions and creates condition-and context-specific models 2 . Since Covert et al. presented the first conceptual framework on transcriptional regulation of metabolic models in 2001 3 , several approaches have been developed to investigate how the integration of gene expression could affect the content and predictive accuracy of a GEM [4][5][6] . These frameworks, which either are based on a new algorithm or are modifications of the previous frameworks, differ in their assumptions and mathematical formulations. All currently available integration methods are listed in Table 1. With the increasing interest in these integration approaches, multiple publications have also focused on the benchmarking of the methods used to generate context-specific models, and evaluated the advantages and disadvantages of different approaches 2,7 . Currently, none of these algorithms systematically outperforms the others, as each of them has specific pros and cons depending on the type of data available to tailor the GEM 2 .
One of the main challenges that limits overall predictive capability of these methods is setting different parameters e.g., gene expression threshold to find significantly differential expressed genes 7 . As enzymes have different expression levels and activities, determination of whether the protein is expressed or not by applying a uniform threshold for all expression data is suboptimal 8 . Here, we propose tackling this problem by transforming the transcription data to a higher-level space (gene sets instead of genes) which is a more biologically relevant set of features, which then can be integrated into a GEM. Critical for such transformation is the selection of the annotation set used to classify genes to respective gene groups: here we used a highquality, manually derived set of annotations, which comprised mainly metabolic proteins. We therefore combined existing methods for transcriptional data integration into a GEM with a data normalization routine in order to more accurately predict the metabolic phenotypes of a model organism Saccharomyces cerevisiae. In general, our framework can be combined with all developed integration methods, however, here we illustrate our approach using the Gene Inactivity Moderated by Metabolism and Expression (GIMME) 9 algorithm.
We first analyzed the transcription data by ssGSEA, an extension of Gene Set Enrichment Analysis (GSEA) 10 , to calculate enrichment scores (ES) for each annotated gene set which associated to the particular biological processes or pathways. Actually, ES represents the degree and ranks the genes in a particular gene set according to the values of expression. Thus, we reconciled the enrichment score with the GIMME algorithm. We show that the models, generated using normalized expression data, outperform the accuracy of standard GIMME models in certain cases. 2012 H. sapiens mCARDE uses the same framework as MBA, but with a different formulation for non-core reactions. mCARDE ranks non-core reactions according to their expression as well as weighted connectivity in the given network, and then keeps reactions which are required for model consistency.  In this study, the predictive accuracy of context-specific metabolic models was evaluated using their ability to predict metabolic fluxes under different scenarios. The prediction accuracy was determined by comparing the predicted flux values with experimentally measured values and was evaluated using statistical metrics. Additionally, the accuracy of essential gene prediction, growth rates and other phenotypic traits under different environmental conditions could also be taken into consideration.

RESULTS
Predicting growth in glucose-limited chemostats We evaluated and compared context-specific GEMs of S. cerevisiae built by different methods. First, we assessed their performance using the experimental data from glucose-limited chemostats at varying dilution rates. For each medium dilution rate of the chemostat, we generated respective context-specific models (using both GIMME and the combined method, ssGSEA-GIMME) using the respective RNA-seq data. Then we benchmarked the performance of these methods in predicting both intracellular and extracellular metabolic fluxes, based on the experimentally determined fluxes of glucose and maximal oxygen consumption by S. cerevisiae cultures.
We first compared the flux balance analysis (FBA) predictions of the main exchange fluxes for both the conventional Yeast8.4.2 and the context-specific models (Fig. 1a-c). All three methods correctly predicted fluxes in the range of growth rate μ < 0:25h À1 , as the growth in this range is limited only by the substrate availability. We also checked the flux intervals that would result in the same value of the objective function (proxy for the specific growth rate) using flux variability analysis (FVA). However, we noticed only minor changes in the sums of flux variability intervals when comparing GIMME and ssGSEA-GIMME.
Under glucose-limited conditions, S. cerevisiae exhibits fullyrespiratory energy harvesting until a certain growth rate, the socalled critical growth rate μ crit , is reached. For growth with rates above μ crit , ethanol formation is detected. For the wild-type S. cerevisiae, the value of μ crit ¼ 0:28h À1 for glucose-limited chemostats was determined by a number of studies 11,12 . As expected, both the conventional GEM and context-specific models predicted ethanol formation above the μ crit , with predicted values of μ crit ¼ 0:273 h À1 for the conventional GEM, and μ crit ¼ 0:253 h À1 and μ crit ¼ 0:272 h À1 for GIMME and ssGSEA-GIMME, respectively. From the modeling perspective, it was shown that for this to happen, a second global constraint (the first being the limitation of the substrate, and the second being limited oxygen uptake) has to be defined in the model (see 13 for more details). Yet GIMME predicted ethanol formation at a lower growth rate than the experimental data suggested. Meanwhile, both the conventional GEM and ssGSEA-GIMME captured the onset of ethanol formation correctly, in agreement to the experimentally determined critical growth rate.
Furthermore, we compared the predictions of intracellular fluxes at μ ¼ 0:30h À1 for all three model implementations with previously published literature data (Fig. 1d). Unsurprisingly, when taking into account all the flux predictions in the metabolic network, both context-specific models showed moderately improved predictions compared to the conventional GEM ( Fig. 1e-g). Both GIMME and ssGSEA-GIMME models showed rather similar flux profiles (Fig. 1f, g), in agreement with the similar exometabolite flux predictions in Fig. 1b, c. Taking together the results of these two tests, we suggest that ssGSEA-GIMME indeed has potential to refine the predictions of metabolic fluxes better, compared to GIMME. Importantly, ssGSEA-GIMME performed better in predicting the critical growth rate of ethanol formation for glucose-limited cultures of S. cerevisiae.
Testing flux predictions of growth on different substrates We achieved only moderate improvement on predicting growth on glucose-limited chemostat cultures with ssGSEA-GIMME, compared to GIMME alone. Therefore, we wanted to determine whether the improvement (or lack thereof) of flux predictions depends on the growth condition (nutrients supplied). To that end, we further explored the predictive capabilities of the contextspecific models by using the previously published data 14 . In the study, S. cerevisiae was grown in four different carbon sourcelimited chemostats (glucose-, maltose-, ethanol-, and acetatelimited cultures) at the dilution rate D ¼ 0:1h À1 .
Evaluation of the determination coefficient for the predictions of the computational methods (Fig. 2) suggested that, compared to the conventional GEM, GIMME alone improved predictions for all carbon sources but ethanol. For growth on glucose and acetate, the context-specific models, with and without applying ssGSEA, showed very similar predictions. We saw similar effects when taking into consideration the fluxes in central carbon metabolism ( Fig. 2; see Figure S1 for more details).
he low accuracy of the conventional GEM for predicting glucose-limited growth (Fig. 2a) was, among other fluxes, due to unreasonably high fluxes through succinate dehydrogenase, surpassing the experimentally determined value by some 6-fold ( Figure S1). Contrary to the conventional GEM, both GIMME and ssGSEA-GIMME models showed improved predictions for the flux through succinate dehydrogenase.
Both GIMME and ssGSEA-GIMME showed incremental improvement in predicting fluxes for growth on maltose and acetate (Fig. 2b). For growth on ethanol, only ssGSEA-GIMME showed improvement of the flux predictions, compared to the conventional GEM. To our surprise, predictions of the model, generated with GIMME alone were worse than these of the conventional GEM. Based on our observations, here we conclude that the improvement of context-specific models by data normalization indeed is dependent on the condition analyzed. While normalization of the expression data, using, among other methods, ssGSEA, remarkably improves the predictions for some growth conditions, prediction of quantitatively-sound flux distributions are still very cumbersome for some growth conditions.

DISCUSSION
Since 2001, different frameworks have been developed or modified to investigate how gene expression integration into GEM could influence model content and increase its predictive accuracy.
Integrating gene expression into fluxes is intrinsically hindered by the assumption that "the mRNA transcript level is correlated with enzyme activity", which should be a systemic property of metabolism. In addition, thresholding is a crucial step in gene expression analysis, as it helps to identify and filter out genes that are not differentially expressed, thus reducing the amount of noise in the data. There are different types of thresholding methods, such as standard deviation thresholding (STANDEP) and local T2, which have been used in gene expression analysis. However, these thresholding approaches have limitations, particularly when it comes to detecting subtle changes in gene expression, especially in complex systems such as metabolic networks.
Therefore, here we retrieve a functional profile of the gene sets (a systemic mode), in order to better integrate the underlying metabolic processes. Our central hypothesis is that gene set enrichment as a pre-processing step can increase predictive accuracy of context-specific models. This approach differs from traditional thresholding methods in that it focuses on the functional enrichment of gene sets rather than individual genes, making it more robust in detecting subtle changes in gene expression. ssGSEA works by normalizing gene expression data Fig. 1 Characterization of the predictions of context-specific models for growth of S. cerevisiae under glucose-limited conditions. a-c Flux predictions of metabolite exchange (glucose, oxygen, ethanol and carbon dioxide) for changing dilution rates for glucose-limited chemostats by the conventional Yeast8.4.2 a and the context-specific models, created using GIMME b and ssGSEA-GIMME c. Experimental data, triangles taken from ref. 12 and circles from ref. 22 . d Comparison of experimentally determined metabolic fluxes of the central carbon metabolism (E) and predictions by the conventional GEM (M), GIMME (G) and ssGSEA-GIMME (S) model at μ = 0.30 h -1 . e-g Correlation between the experimental data (X-axis) and model predictions (Y-axis) for different model implementations. Experimental data in d and on the X-axes of e-g taken from ref. 57 . Flux values were normalized to that of hexokinase reaction (v HEX rel = 1). Glc glucose, Mal maltose, G6P glucose 6phosphate, F6P fructose 6-phosphate, 6PGL 6-phosphogluconolactone, F16BP fructose 1,6-bisphosphate, DHAP dihydroxyacetone phosphate, GAP glycerol aldehyde 3-phosphate, G13BP  based on the gene's metabolic function, which allows for a more accurate representation of the underlying biological processes (Fig. 3).
To evaluate this proposed framework, we combined ssGSEA with the GIMME algorithm and applied it to two gene expression datasets; yeast grown in glucose-limited chemostats at different dilution rates and growth of S. cerevisiae in four different carbon source-limited chemostats at a constant dilution rate. This is a first attempt to integrate normalized (based on metabolic function) gene expression data into GEM, however, the framework could be improved by considering other enrichment analysis approaches 15 . However, our attempts to apply to generate context-specific models using INIT 16 or iMAT 17 resulted in models unable to predict biomass formation. While this issue could be mitigated by supervised generation of context-specific models (i.e. manually defining the reactions which should not be removed during generation), such a need severely hampers both usability and usefulness of models generated by these methods.
Our assessment of the predictions by GIMME and ssGSEA-GIMME shows that the combined framework performs well under all scenarios, outperforms conventional GIMME in some scenarios, and shows promise in its ability to improve the performance of integration approaches.
Conclusively, our findings suggest that ssGSEA has the potential to improve the accuracy of metabolic models, particularly when integrated with the GIMME algorithm. While this is a novel idea and requires further validation with more data, it highlights the importance of considering functional enrichment in gene expression analysis, and the potential benefits of using a method like ssGSEA.
While our approach has shown promising results, it is important to acknowledge some of the limitations of our study.
First, the sample size in our analysis was limited; therefore, further validation on larger sample sizes is necessary to confirm the robustness of our findings. Additionally, we were limited in our choice of data sources, and further research could explore how our approach performs with a broader range of data.
Second, the importance of carefully considering the impact of thresholding on the accuracy of context-specific metabolic models. The integration of transcriptomic data into a genomescale metabolic model plays a critical role in studying cellular metabolism and developing context-specific metabolic models. The thresholding of gene expression data significantly impacts the accuracy of these models. Thresholding refers to setting a minimum threshold for gene expression levels and determining which genes are included or excluded from the models. This process is used to reduce the impact of noise in the data and to increase the accuracy of the models. However, the impact of thresholding on the accuracy of context-specific metabolic models is a complex issue that requires a comprehensive understanding of the interplay between gene expression data and metabolic pathways.
Three key papers in this field include "StanDep: Capturing transcriptomic variability improves context-specific metabolic models" by Joshi et al. 18 , "Assessing key decisions for transcriptomic data integration in biochemical networks" by Richelle et al. 19 , and "Guidelines for extracting biologically relevant context-specific metabolic models using gene expression data" by Gopalakrishnan et al. 20 . These papers provide valuable insights into the best practices for integrating transcriptomic data into metabolic models and the impact of thresholding on their accuracy.
The ssGSEA-based method that we developed represents a new approach to investigating the impact of thresholding on the accuracy of context-specific models. Our method is distinct from other approaches in the field, as it utilizes ssGSEA to extract biologically relevant information from gene expression data. This innovative approach brings several benefits to the field, including improved accuracy and efficiency, as demonstrated by our results.
Additionally, the core reaction sets generated using ssGSEA may differ from those generated using other local thresholding approaches. This difference may arise due to the different criteria used by ssGSEA to identify gene sets for reduction. As for handling lowly expressed genes with a low-flux metabolic activity, our method does not inherently exclude these genes from the final model. Instead, the model reduction process performed by ssGSEA focuses on the metabolic function of genes and reactions as well as gene ontology rather than their expression level or flux activity. In this way, the reduction process can capture the most relevant metabolic pathways, regardless of the expression or flux activity of individual genes. This approach helps to mitigate the effects of gene expression variability and ensures that the core reaction sets generated are biologically relevant. Our results demonstrate the potential of our method to provide a deeper understanding of the impact of thresholding on the accuracy of context-specific models. Our findings are important for advancing the field, as they shed light on how ssGSEA can be utilized to extract biologically relevant information from gene expression data.
We acknowledge that our method has limitations and that there is room for improvement. However, we are confident that our results provide a solid foundation for future work in this area and represent a step forward in developing context-specific models.

MATERIALS AND METHODS
An overview of the conceptual framework for transcriptional data integration into a GEM To devise and validate our new framework for integrating transcriptomics data into a GEM, we have combined two previously developed algorithms (ssGSEA and GIMME) to increase the accuracy of metabolic functionalities of GEMs.
ssGSEA, a gene set enrichment analysis approach Gene Set Enrichment analysis (GSEA) is a functional genomic analysis, which uses the pre-defined set of genes/proteins across highthroughput data and determines statistically significant sets that are concordantly different between two biological states. Indeed, this method is an approach for finding over-represented genes associated with a phenotype. GSEA identifies either significantly enriched (top of ranked genes list) or depleted (bottom of ranked genes list) gene sets and elicits a quantitative enrichment score of over-represented gene sets at the top or bottom of the list of the ranked genes. Then, this method uses the permutation test to estimate P-values and finally normalizes the enrichment score (NES) for each gene set and adjusted P-value to multiple hypotheses testing and False Discovery Rate (FDR) presented. Single sample Gene Set Enrichment analysis (ssGSEA) 21 is a customized version of GSEA, that similar to GSEA detects biologically relevant gene sets that are over-represented (top or bottom of the ranked gene) in a single sample. The ssGSEA calculates an ES in a single sample without requiring control data. The genes for a sample were ranked (and normalized) from high to low using the Empirical Cumulative Distribution Functions (ECDF) and absolute expression values. The ES is obtained by a sum (integration) of the differences between the weighted ECDF among genes of a gene set relative to the genes that are not existent in the given set. GIMME, an algorithm for creating a consistent GEM with a desired metabolic objective The GIMME algorithm 9 weights the reactions by gene expression values and uses a threshold for minimizing low-expressed reactions as inactive while keeping the objective above a certain value. In this study, we used the adapted version of the original GIMME by S. Opdam and A. Richelle 2017 7 , in which differences in threshold and expression levels were used as weights and set as Fig. 3 Workflow of the proposed concept in contrast to the standard framework. Circle, square and hexagon symbols stand for a given pathway. Here, we depicted that in the traditional (existing) framework a whole pathway flux might be inhibited due to a single enzyme down-regulation.
objective coefficients of the reactions. The threshold was 25 percent quantile subtracted from the expression data and reactions without expression data were given a weight of −1, which prevented those reactions from deletion during model construction. We set the tolerance for the reduction of the value of the objective function to 0.9.
ssGSEA-GIMME, a combined framework for data normalization and integration To improve the accuracy of integration methods, we combined the results of ssGSEA and the GIMME methods. To categorize reactions of the yeast metabolic model with high confidence, we mapped metabolic processes to the model gene-protein-reaction (GPR) associations, using a manually curated annotation set of proteins, assigned to different metabolic processes, first introduced in ref. 22 . The ssGSEA analysis was performed using R code provided by Broad institute (Original code written by Pablo Tamayo and modified by Karsten Krug) (https://github.com/ broadinstitute/ssGSEA2.0). The minimum number of genes needed for a gene set was three. We used the area under curve ("area.under.RES") as a statistic, the z-score as correlation type, and performed 1000 permutations. In our analysis, we used a z-score cut-off value of 1.96. The scores calculated by the ssGSEA analysis were used as weights to map genes expression values to the corresponding reaction expression values, which we then fed into GIMME to construct the context specific GEM.
The genome-scale metabolic model of yeast We used the consensus genome-scale metabolic model of S. cerevisiae. Version 8.4.2 was downloaded from the project's GitHub repository (https://github.com/SysBioChalmers/yeast-GEM), which contains 4058 reactions, 2742 metabolites, and 1150 genes.
All bounds used the original flux bounds of the model, except for the experimentally determined fluxes through 'D-glucose exchange' (as measured) and 'oxygen exchange' (setting the lower bound to maximal determined uptake rate) for the first study (glucose-limited chemostats), and 'D-glucose exchange', 'ethanol exchange', 'acetate exchange', and 'maltose exchange' for the second study (carbon source-limited chemostats). We set the biomass function to 'biomass pseudoreaction' for both studies.
Omics data collection; RNA-seq and flux data We collected the omics data (transcriptome) and physiological measurements from two studies. First, in an unpublished dataset provided by Simon Hubbard (University of Manchester, UK), S. cerevisiae strain CEN.PK113-7D was grown in a chemically-defined minimal (Verduyn) medium with glucose as the main carbon source. Cells were cultivated in two independent chemostats at dilution rates spanning from D ¼ 0:20h À1 to 0:34h À1 and both supernatant and cell samples were taken for determination of exometabolite fluxes 22 , using analytical procedures described in 23 and RNA-seq analysis, as described in 24 . In this study, the sequencing was performed on the ABI SOLiD platform. The reads were aligned to the S. cerevisiae genome assembly EF4, which was downloaded from ENSEMBL, using the Bowtie version 1 software 25 .
To determine the expression values, the Reads Per Million (RPM) normalization method was applied.
In a study by Daran-Lapujade et al. 14 , S. cerevisiae strain CEN.PK113-7D was grown in a chemically-defined minimal (Verduyn) medium with one of four carbon sources (glucose, maltose, ethanol, or acetate) as the main carbon source at D ¼ 0:10h À1 . The results of transcriptomics and flux analysis were described in 14 . Microarray profiling data was taken from the GEO database, accession number GSE8895. The Microarray Analysis consisted of taking samples from chemostats, preparing the probes, and hybridizing them with Affymetrix GeneChip® microarrays. The findings for each growth scenario were derived from three separate, independently grown replicates.

Generation of context-specific GEMs
For each condition, we created two sets of context-specific GEMs: the first one constructed by deploying the standard GIMME method and the second one, deploying GIMME with transcriptomics data, enriched by ssGSEA. All models and related codes are available from the GitHub repository https://github.com/mahjalili/ ssGSEAGEM.

DATA AVAILABILITY
The experimental data from glucose-limited chemostat studies are partially available: RNA-seq data was provided by Prof. Simon Hubbard and therefore not made public in this manuscript and flux data is reported in (https://doi.org/10.1101/ 2021.06.11.448029v2). Data from different carbon source-limited chemostats is publicly available (https://doi.org/10.1074/jbc.M309578200).